Robert Turner, University of Sheffield RSE Team September, 2021
Heavily based on Reproducible Research Data and Project Management in R by Anna Krystalli, naming things by Jenny Bryan and Methods in Research Software Engineering by David Wilby.
Mix of software engineering and research experience.
13 RSEs, 35 projects / year worth ~£11m total
Research is hard, let’s not make it harder.
PLoS Medicine
Is this true?
The Turing Way project illustration by Scriberia. Used under a CC-BY 4.0 licence. DOI: 10.5281/zenodo.3332807.¶
Practical advice on:
What operating system(s) do you use?
What programming language(s) do you use?
Some years ago, Tom Webb (@tomjwebb) asked for advice on Twitter. Some of the resulting conversation is included in this presentation…
@tomjwebb @ScientificData Talk to their librarian for data management strategies #datainfolit
— Yasmeen Shorish (@yasmeen_azadi) January 16, 2015
Act as though every short term study will become a long term one @tomjwebb. Needs to be reproducible in 3, 20, 100 yrs
— Oceans Initiative (@oceansresearch) January 16, 2015
Take initiative & responsibility. Think long term.
@tomjwebb stay away from excel at all costs?
— Timothée Poisot (@tpoi) January 16, 2015
Do you agree?
@jaimedash just don’t let excel anywhere near dates or times. @tomjwebb @tpoi @larysar
— Dave Harris (@davidjayharris) January 16, 2015
THRILLED by this announcement by the Human Gene Nomenclature Committee. pic.twitter.com/BqLIOMm69d
— Janna Hutz (@jannahutz) August 4, 2020
But good for data viewing / entry, sometimes, perhaps…
@tomjwebb Entering via a database management system (e.g., Access, Filemaker) can make entry easier & help prevent data entry errors @tpoi
— Ethan White (@ethanwhite) January 16, 2015
@ethanwhite +1 Enforcing data types, options from selection etc, just some useful things a DB gives you, if you turn them on @tomjwebb @tpoi
— Gavin Simpson (@ucfagls) January 16, 2015
@tomjwebb it also prevents a lot of different bad practices. It is possible to do some of this in Excel. @tpoi
— Ethan White (@ethanwhite) January 16, 2015
Have a look at the Carpentries Databases and SQL lesson or SQL for Ecology lesson.
.csv: comma separated values..tsv: tab separated values..txt: no formatting specified.@tomjwebb It has to be interoperability/openness - can I read your data with whatever I use, without having to convert it?
— Paul Swaddle (@paul_swaddle) January 16, 2015
What file formats do you need to work with?
Andrea De Santis, unsplash.com
.csv or .tsv copy would need to be saved.Use good null values, missing values are a fact of life:
NA or NULL are also good options0. Avoid numbers like -999@tomjwebb don't, not even with a barge pole, not for one second, touch or otherwise edit the raw data files. Do any manipulations in script
— Gavin Simpson (@ucfagls) January 16, 2015
@tomjwebb @srsupp Keep one or a few good master data files (per data collection of interest), and code your formatting with good annotation.
— Desiree Narango (@DLNarango) January 16, 2015
Raw data are sacrosanct
Photo by Jon Moore, unsplash.com
Photo: Pexels CC0
main copy of files@tomjwebb Back it up
— Ben Bond-Lamberty (@BenBondLamberty) January 16, 2015
RNO
myabstract.docx
Joe’s Filenames Use Spaces and Punctuation.xlsx
figure 1.png
fig 2.png
JW7d^(2sl@deletethisandyourcareerisoverWx2*.txt
YES
2014-06-08_abstract-for-sla.docx
joes-filenames-are-getting-better.xlsx
fig01_scatterplot-talk-length-vs-interest.png
fig02_histogram-talk-attendance.png
1986-01-28_raw-data-from-challenger-o-rings.txt
What makes a good file name?
In the following:
ls -lh *Plasmid*
*Plasmid*
is a glob.
Deliberate use of "-" and "_" allows recovery of metadata from the filenames:
"_" underscore used to delimit units of metadata I want to access later"-" hyphen used to delimit words so our eyes don’t bleedThis happens to be R but also possible in the shell, Python, etc.
e.g. I’m saving a number of files of temperature data extracted at different resolutions (res) and for a number of months (month). Including these parameters in the filename allows me to use them to target files to read in.
write.csv(df, paste("variable", res, month, sep ="_"))
df <- read.csv(paste("variable", res, month, sep ="_"))
The scripts
01_marshal-data.r
02_pre-dea-filtering.r
03_dea-with-limma-voom.r
04_explore-dea-results.r
90_limma-model-term-name-fiasco.r
The figures left behind
02_pre-dea-filtering-preDE-filtering.png
03-dea-with-limma-voom-voom-plot.png
04_explore-dea-results-focus-term-adjusted-p-values1.png
04_explore-dea-results-focus-term-adjusted-p-values2.png
...
90_limma-model-term-name-fiasco-first-voom.png
90_limma-model-term-name-fiasco-second-voom.png
Use the ISO 8601 standard for dates: YYYY-MM-DD
If you don’t left pad, you get this:
10_final-figs-for-publication.R
1_data-cleaning.R
2_fit-model.R
which is just sad :(
Go forth and use awesome file names :)
Where shall I put my data?
myproject/
|
├── 01_data/
| ├── 01_raw/
| ├── 02_working/
| └── 03_clean/
|
├── 02_scripts/
|
├── 03_figures/
|
├── 04_paper/
|
├── 05_presentation/
|
├── readme.md
|
└── license.md
R (rrtools)analysis/
|
├── paper/
│ ├── paper.Rmd # this is the main document to edit
│ └── references.bib # this contains the reference list information
│
├── figures/ # location of the figures produced by the Rmd
|
├── data/
│ ├── raw_data/ # data obtained from elsewhere
│ └── derived_data/ # data generated during the analysis
|
└── templates
├── journal-of-archaeological-science.csl
| # this sets the style of citations & reference list
├── template.docx # used to style the output of the paper.Rmd
└── template.Rmd
Good to include:
requirements.txt, environment.yml etc..prj file (xml) question for Mathworksrenv.lock - use renv packageDon’t write your own dependency management.
@tomjwebb I see tons of spreadsheets that i don't understand anything (or the stduent), making it really hard to share.
— Erika Berenguer (@Erika_Berenguer) January 16, 2015
“Information that describes, explains, locates, or in some way makes it easier to find, access, and use a resource (in this case, data).”
Can you think of any examples?
Anything is better than nothing!
readme.md - not machine readablejson, yml, xml - can potentially be human and machine readableA lightly opinionated guide to reproducible data science https://the-turing-way.netlify.com